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CN ; Abstract 

The multiple change point problem is considered in the most general 
^ ' setting, where the only assumption made on the time series distributions 

generating the data is that they are stationary ergodic. No modeling, in- 
dependence or parametric assumptions are made. While the need for such 
a general setting is dictated by real applications, the problem of change 
Cu , point estimation becomes a difficult unsupervised learning problem. In 

C/3 ' this work a novel algorithm for solving this problem is proposed, and it 

is shown to be asymptotically consistent under the general assumptions 
^ . considered. 

> 

^ '. 1 Introduction 

We are given a sequence Xi, . . . , X„. The sequence is composed of k + 1 non- 
-^ " overlapping segments such that the process distributions that generate every 

f~^ , pair of consecutive segments are different. The index where one segment ends 

fSJ ' and another starts is called a change point. Thus, we have k unknown change 

points and our goal is to estimate them. 

Change point analysis is one of the core problems in classical mathematical 
statistics [TJ[llinilll[Sl[Sl[7]. A typical formulation of the change point problem 
is that in each segment the points are independent and identically distributed 
j^ , (i.i.d.) and the change refers to a change in the mean (that is, Xi, i = l..n in 

different segments have different means). While more general frameworks have 
also been considered, even in nonparametric settings the approaches are mostly 
based on strong assumptions on the form of the change and dependence [3J |3] . 
However, such strong assumptions do not necessarily hold in most of such 
real-world applications as bioinformatics, network traffic, market analysis, au- 
dio/video segmentation, fraud detection etc. Methods used in these applications 
are thus usually model-based or employ application-specific ad-hoc algorithms. 
More specifically, a theoretical framework to allow for the understanding of what 
is possible and under which assumptions is entirely lacking. 

In this paper we consider the following general setting. The distributions 
that generate the data are unknown; our mere assumption is that they are sta- 



tionary ergodic. This general assumption on the distributions allows for the 
data to be arbitrarily dependent. This means we do not require the sequences 
before and after the change points to be independent. In fact the samples are 
allowed to be dependent, and this dependence can have an unknown form and 
structure; it may even be adversarial. Moreover, the marginal distributions be- 
fore and after the change points are allowed to be the same. 
Results. We provide a novel nonparametric multiple change point estimation 
algorithm for time-series data. We further demonstrate that the proposed algo- 
rithm is asymptotically consistent in the general setting described above. The 
number of change points k is assumed known, but the number of distributions 
is unknown (thus, it ranges from 2 to k -|- 1). 

In the general setting of highly-dependent time series, correct estimation 
of the number of change points is provably impossible, even in the weakest 
asymptotic sense, and even if there is at most one change [5] . While a popular 
mitigation is to consider more restrictive settings, in this work we are interested 
in intermediate formulations with asymptotically consistent solutions under the 
most general assumptions. In particular, we assume that the correct number re 
of change points is known and provided to the algorithm. The particular case of 
K — 1 has been considered in [5] where a simple consistent algorithm to estimate 
one change point is provided. It turns out that the general case of k > 1 is much 
more complex. Since the number of change points is more than one, there exists 
at least one segment somewhere in the middle of the sequence that lies between 
a pair of change points, and whose length can be arbitrarily small (even though 
we assume that the length of each segment is asymptotically linear in n, there 
is no a priori lower bound on it). Thus we need to be able to simultaneously 
analyze all the segments of the sequence Xi, . . . , Xn of arbitrarily small lengths. 
Usually in statistics, this problem is mitigated via tools based on the speed of 
convergence of sample averages to expectations. In the context of stationary 
ergodic processes, such tools are unavailable as no guarantees on the speed of 
convergence exist. Hence, the simultaneous analysis of segments of arbitrarily 
small lengths is conceptually much more difficult. 

We overcome this problem by combining many different change point esti- 
mates, each of which assumes some lower bound on the distance between the 
change points. For each change point a final estimate is given as a weighted com- 
bination of the estimates. The weights are designed to reflect the performance 
of each estimate. This approach may be reminiscent of prediction with expert 
advice |10j with the difference that in the framework we consider, performance 
cannot be measured directly. 

Although the main results of this work are theoretical, all the methods we 
present can be computed efficiently. Our methods are based on empirical es- 
timates of the so-called distributional distance. The distributional distance is 
a well-known metric in statistics |llj for which a consistent empirical estimate 
was recently proposed in [S] . This distance has proved useful both theoretically 
and practically in various learning problems involving stationary ergodic time 
series [H [HI HI] • 
Related Work. Most of the existing literature on nonparametric change point 



estimation involve considerably more restrictive settings. For example, the ad- 
ditional assumptions usually made in nonparametric settings include that the 
samples are i.i.d. in each of the segments [13 [HI [T71 [H], or that the distribu- 
tions obey some strong conditions on the nature of the dependence (e.g. are 
strongly mixing) [THJ [23 [S] : or that they belong to some specific model class 
(such as Hidden Markov processes) [35| [23] . In these frameworks the problem 
of estimating the number of change points is usually addressed with penalized 
criteria, see, for example, [231 [IS]. Moreover, it is almost exclusively assumed 
that the single-dimensional marginal distributions are different [5] . What distin- 
guishes our work from the related literature is that, first, we do not require that 
any fixed-sized finite-dimensional marginals before and after the change points 
to be different. Second, we do not make any assumptions on the structure of 
dependence (no independence, memory or mixing assumptions). Our only as- 
sumption is that the unknown distributions generating the data are stationary 
ergodic. 

Organization. In Section [2] we introduce some notation and definitions. In 
Section [3] we formalize the problem. In Section 2] we present our algorithm and 
informally explain how it works. We also provide a brief discussion on its com- 
putational complexity. Finally in Section[S]we provide some concluding remarks 
and future directions. In Section [5] we prove the consistency of the algorithm. 

2 Notation and definitions 

Let X be some measurable space (the domain); in this work we let X — M., 
but extensions to more general spaces are straightforward. For a sequence 
Xi,...,Xn we use the abbreviation Xi..„. Consider the Borel a-algebra B 
on X°° generated by the cylinders {B x X°° : B G B™'\m,l £ N}, where the 
sets B™'', TO, I G N are obtained via the partitioning of X"^ into cubes of dimen- 
sion m and volume 2~™' (starting at the origin). Let also B™ := UfeN^™''. 
Processes are probability measures on the space {X°°,B). For x = Xi,,n G X^ 
and B G B"^ let z^(x, B) denote the frequency with which x falls in B, i.e. 

^(x,i3):^li!L^ J2 I{X,..,+„_i G B} (1) 

n — m + 1 ^-^ 

A process p is stationary if for any i,j G l..n and B G B"^, m ^ N, we have 
p{Xi_j G B) = p{Xi_i^j-i G B). A stationary process p is called (stationary) 
ergodic if for all i? G i3 we have lim„_>oo '^(-^i..n, B) — p{B) almost surely. The 
distributional distance between a pair of processes pi and p2 is defined as follows 

oo 
d{pi,P2)-^ X! ^""^^ X! \PliB) - P-2iB)\ 

where, Wi := 2^', i G N. Note that any summable sequence of positive weights 
also works. For more on the distributional distance and its properties see |11] . 



We use empirical estimates of this distance. Specifically, the empirical esti- 
mate of the distance between a sequence x = Xi..„ S A"", n G N and a process 
p is given as 

oo 

J(x,p):= ^ m^zi;, ^ |Kx,i3) - p(i?)| (2) 

and that between the sequences x^ € A""' n^ e N, i = 1, 2 is 

oo 
(i(xi,X2):= ^ WmlUi ^ |j/(xi,B) " Z^(X2,B)|. (3) 

md=l BeB^"-' 

Although expressions ^ and ^ involve infinite summations, they can be com- 
puted efficiently [9]. 

The triangle inequality holds for the distributional distance d{-,-) as well as 
its empirical estimates d{-, •); for all distributions pi, i — 1..3 and all sequences 
Xi £ A""* Ui E N, i = 1..3 we have 

d{pi,P2) < d{pi,p3)+d{p2,P3) 
d(xi,X2) < (i(xi,X3) +d(x2,X3) 

c?(xi,/3i) < d(xi,p2) +d(/9l,p2)• 
The distributional distance d(-, •) and its empirical estimates (i(-, •) are con- 
vex functions; that is, for every a e (0, 1) we have 

d{pi,ap2 + {I - a)p3) < ad{pi,p2) + {I - a)d{pi,p3) 

d(xi,Q!X2 + (1 - a)x3) < Q!d(xi,X2) + (1 - a)d(xi,X3) 

d(/9, axi + (1 - a)x2) < Qd(p, xi) + (1 - a)rf(/9, X2) 

for all distributions p, pi, i = 1..3 and all sequences Xi G A"' n^ e N,i = 1..3. 
As shown in [5], the estimate d{-, •) are asymptotically consistent: for sequences 
Xi G X"^ and X2 G X^'^ , generated by stationary ergodic distributions pi, i = 
1 , 2 we have 



lim d(xi,X2) = d(pi,p2), P- a.s., and (4) 

ni ,712^*00 

lim d{yi^,pJ) ^ d{p^,pJ), z,jGl,2, /? - a.s. (5) 

A more general estimate of d(-, ■) m.ay be obtained as 

rrin In 

J(xi,X2) := ^ ^WmWi^ \i^{-Ki,B) - I^{-K2,B)\ (6) 

m=l 1=1 B^B™-'- 

where, to„ and Z„ are any sequences of integers that go to infinity with n. As 
shown in [13] the consistency results for (i(-,-), i.e. ([2]), ([3]), equally hold for 
d(-, •) so long as m„, Z„ go to infinity with n. 



Let X — Xi..„ be a sequence and consider a subsequence Xa..b of x with a < b G 
l..n. Define the intra-subsequence distance oi Xa..b as 

Ax(a,&) :- rf(X^..L^j,Xp^^, J, (7) 

and the single-change-point-estimator of Xa..b, a < b as 

$x(a, 6,Q:):= aTgiiia^d{Xa-„a..t,Xt..b+na), ae(0, 1). (8) 

te[a,b] 

3 Problem formulation 

The multiple change point estimation problem can be formalized as follows. We 
are given a sequence 

X!=J(l,...,J(^ G ^ 

formed as the concatenation of some >f + 1 sequences 

Each of these sequences is generated by an unknown stationary ergodic process 
distribution. Moreover, the consecutive sequences are generated by two differ- 
ent distributions. The distributions are not required to be independent. The 
parameters iTk are unknown and have to be estimated; they are called change 
points. Thus, a change point is an index between 1 and n such that the sequences 
before and after it are generated by different process distributions. Note that 
we do not require the means, variances or single-dimensional marginals of the 
distributions to be different. We consider the general scenario where the process 
distributions are different. 

A change point estimator is a function that takes a sequence x and a param- 
eter >f and outputs a set {tti, . . . ,7rjj.} C {l..n}^ of estimated change points. 
A change point estimator is asymptotically consistent if with probability 1 we 
have 

lim sup — JTi-fc — TTfcl = 0. 

To construct consistent algorithms, we assume that the change points tt^ are 
linear in n i.e. tt^ :— n6k where 6k G (0,1) k = l..>i are unknown. We also 
define 

6 := min 9^. — O^-i 

k=l..^+l 

where ^o :== and 6^+i := 1, and assume 9 > Q. The reason for these linearity 
conditions is that the consistency properties we are after are asymptotic in n. If 
the length of one of the sequences is constant or sublinear in n then asymptotic 
consistency is impossible in this setting. 



4 Main Results 

We present via Algorithm [1] a multiple change point estimation procedure which 
we show is consistent under the most general assumptions. Here we describe the 
algorithm and explain how and why it works. The proposed algorithm works in 
iterations, on each of which a set of k change point estimates is constructed. The 
algorithm then combines the estimates obtained on all the iterations together. 
On each iteration j the input sequence is partitioned into a grid, the larger j the 
smaller the grid. The candidate change points are then sought in the segments 
of the grid. The single-change-point-estimator $(•,•,•) is used to produce the 
candidate change points. The sets of candidate change points obtained at all 
iterations j are combined with weights that depend on j and on the estimated 
performance of these change point candidates. The performance of each set 
of change point candidates cannot be evaluated directly; instead, we use the 
minimum intra-subsequence distance Ax(-, •) of the segments containing change 
point candidates used in that iteration, as an indicator of performance. 

Theorem 1 (Algorithm [1] is consistent). Let x = A^L.n &e a sequence with >c 
change points denoted ttj,, k = 1..x. Denote TCk, k — 1..m: the estimated change 
points as given by Algorithm[Il taking x and x as inputs. We have 

lim — JTTfc — TTj-l =0 a.s. 

provided the distribution of each segment is stationary ergodic. 

The proof is provided in Section [51 Here we explain how and why the 
algorithm works. 

First observe that the distributional distance d{-, •) is consistent; this means 
that the empirical distributional distance between a given pair of sequences 
converges to the distributional distance between their generating processes. 
From this we can show that if a segment Xa..b for some a,b 6 l..n whose 
length is linear in n does not contain any change points, then its correspond- 
ing intra-subsequence distance Ax (a, 6) converges to with increasing n. On 
the other hand, if there is a single change point n within Xa..b whose distance 
from a and b is linear in n the intra-subsequence distance Ax (a, b) converges to 
a non-zero constant. Moreover, in this case the single-change-point-estimator 
$(a, 6, a), a £ (0, 1) produces an estimate that from some n on converges to tt, 
provided that tt is the only change point within the interval a — na..b + na. 

The algorithm must select >f segments of x whose lengths are linear in n; 
each of the selected segments must contain a smaller segment that is also of 
linear length. There must be a single change point within the selected segment, 
and it must be contained within the smaller segment inside. Moreover, the 
distance between the starting point of the smaller segment must be linear from 
the selected segment which contains it. The same linearity condition must hold 
with respect to their end-points. However, with the available information there 
is no way to select such segments directly. 



Algorithm 1 Estimating k change points for >f > 2 
input: X = Xi..„, # of Change Points >i 
initialize: 77 <^ 
for j=l..logn do 

Set the step size and weight: olj <— 2^^, Wj i~ j~^ 
for t = 1..X + 1 do 

1. Generate an index-sequence: 



naj{i + j^), 



2. Calculate the performance weight 'y{t,j): 
for 1 = 0.. 2 do 

i. Generate intra-distances using Ax given by (j7|) : 

OJ 1 

ii. Store the >t^^ highest intra-distance value: 

end for 

lit J) ^ mjn 7, 

3. Calculate the change points using $x given by ([8]) in x seg- 
ments of highest intra-distance: 

i. 6?'/ ^ b^ s.t. X.t,j ,t.j has the i"' largest A^, i = l..>ir, ^ G 0..2J. 

ii. TTj,'-' := $x(max{l, 6,y — najljminlforj,-! +7iq;j, rij^}, ttj), fc = !..>!: 

4. Update the total sum of the weights: 



end for 
end for 
Calculate the final change points: 



TTfe ^ - ^ ^ Wj7(t, j)4'^ k = 1..X 



j=l t=l 

return: tti 



Lj • ■ • J ">t 



A key observation we make is the following. Consider the partitioning of 
X into >c consecutive segments where there exists at least one segment that 
contains more than a single change point. Since there are exactly yc change 
points, within such partitioning of x there must exist at least another segment 
that does not contain any change points at all. As follows from the asymptotic 
consistency of d(-, •) the segment that contains no change points has an intra- 
subsequence distance Ax(-, •) that converges to 0. With this observation in 
mind, we construct a consistent algorithm as follows. 

Given a sequence x, we iterate over j = L.logn and at each iteration, 
we generate a grid composed of evenly-spaced consecutive segments of length 
nttj, where aj :~ 'IrK The grid is used to generate a set of candidate change 
points as follows. Among the segments of the grid, we select x: segments of 
highest intra-subsequence distance. The single-change-point-estimator $x(', ', •) 
is applied to the segments to produce a candidate for each change point. In this 
process, an ideal scenario is when each one of the selected segments of length 
noij is exactly at the center of a larger segment of length Sria^, where the 
only change point within the larger segment is that which is contained in the 
smaller segment. In this case, the single-change-point-estimator is guaranteed 
to produce asymptotically consistent results. This ideal scenario happens when 
every three consecutive segments of the grid contain at most one change point. 

At every iteration, it is cither the case that the ideal condition we are after 
holds or that the converse is true. In the former case, as explained earlier, the 
change point estimates at this iteration are asymptotically consistent. Recall 
that the algorithm iterates over j = L.logn. Hence, this ideal scenario oc- 
curs from some j on, when aj is small enough so that every three consecutive 
segments contain at most one change point. 

Since it is not possible to directly identify such "good" iterations, for each 
iteration a performance weight 7(-, •) is calculated; it is designed to converge 
to zero, on the iterations where the ideal scenario does not hold. At the same 
time, it converges to a non-zero constant on all the "good" iterations. At a given 
iteration on t,j, "f{t,j) is calculated as follows. First the set of all intervals of 
length 3naj formed by consecutive elements of the index-sequence is partitioned 
into three sets of non-overlapping consecutive intervals. In each partition, the 
x'h highest intra-distance value is stored as 7;, Z = 0..2, and the performance 
weight is obtained as j(t,j) := min;=o..2 7/- On the 'bad' iterations, at least one 
of the three partitions has the property that among every set of x segments in 
the partition, there is at least one segment that contains no change points. In 
this case, Ax(-, •) corresponding to the segment without a change point converges 
to 0. A technical problem occurs when a change point is exactly at the start or 
at the end of a segment. To avoid this problem, for every fixed j, the process 
is repeated x+1 times with distinct starting positions -^^ t = l..>c + 1 for the 
grid. This ensures that for every fixed j we have at least one grid such that 
none of its segments start or end exactly on a change point. Finally, at each 
iteration the change point estimates are combined with two sets of weights: 

1. 7(t, j) to penalize for small intra-subsequence distance of appropriate seg- 



ments. As discussed, "i{t,j) converges to zero on the "had" iterations, 
where the candidate estimates are not guaranteed to be asymptotically 
consistent. 

2. Wj to give precedence to estimates obtained based on longer segments. 
Since the number of iterations increase with n there will be some iterations 
at which the segments arc not long enough to have consistent estimates. 

Computational complexity. The presented method can be implemented 
efficiently. The algorithm is based on empirical estimates rf(-, ■) of the distribu- 
tional distance. While its definition given by ([3]) involves infinite sums, d(-,-) 
can be calculated efficiently. Indeed, in Q all summands corresponding to 
m > maxi=i_2 ni equal 0; moreover, all summands corresponding to I > Smin 
are equal, where 

Smin := min {X^ - Xj\ 

corresponds to the partition in which each cell contains at most one point. A 
more efficient implementation of the distance can be obtained if d{-,-) given 
by ^ is used instead of d(-, •), setting m = logn; in this case, the consis- 
tency results are unaffected, and the computational complexity of calculating 
the distance becomes ©(n poly logn). Thus the most naive implementation of 
the algorithm has complexity ©(n^ polylogn). The choice m = logn is further 
justified by the fact that the frequencies of cells in B™''' corresponding to higher 
values of m are not consistent estimates of their probabilities (and thus only 
add to the error of the estimate); see [131 Hlj for a detailed discussion. 

5 Outlook 

We have presented an asymptotically consistent change point estimation algo- 
rithm for the case where the only assumption on the distributions generating 
the data is that they are stationary ergodic. The number of distributions is 
unknown, but the number of change points is known and supplied to the algo- 
rithm. Among the possible extensions, the first that comes to mind is the case 
of unknown number of change points. As mentioned in the introduction, this 
problem has provably no solution in this general setting. Instead of restricting 
the setting, it would be interesting to consider some intermediate formulations. 
One possible formulation is that while the number of change points is unknown, 
the number of distributions generating the data is known. This assumption can 
be natural in some practical applications. For example, the case of just two 
distributions can be interpreted as normal versus abnormal behavior; one can 
imagine a sequence with many change points in this scenario. Another exten- 
sion can be made by analogy to the clustering problem. In clustering, when 
the number of clusters is unknown, a possible goal is to construct a hierarchy 
of clusterings (see e.g. [IS]). A similar formulation may be considered for the 
change point problem. 



6 Proof of Theorem [T] 

The proof oj the theorem relies on several technical statements, i.e. Lemmas\^- 
whose proofs can be found in the appendix. We introduce the following addi- 
tional notation. 

Definition 1. For every change point -Kk, k ~ l..x we denote by L*'-'(7rfe) 
and by i?*'-'(7rfe) the elements of the index- sequence b^'' , i — 1..2^ that appear 
immediately to the left and to the right of TTfe respectively, i. e. 

L^-iU^):^ max b]'^ and R^^H-ku) '.^ min 6*'^ 

Equality corresponds to the case where a change point -Kk for some k 6 l..>c is 
exactly at the start or at the end of a segment. 

Before we proceed to the proof of the main theorem, we provide the fol- 
lowing outline. First, observe that at a given iteration on j and t the ideal 
scenario where it would be possible to have asymptotically consistent estimates 
of each one of the change points, is when for every pair of consecutive indices 
6j'-', &j^]^, i = 1..2-' — 2 the sequence X^t.j ^tj has exactly one change point so 
that, 



1. The indices do not "hit" the change points, i.e. 

TTky^b*^', i = 0..2^, kei..H 

2. For every pair of consecutive change points, TTfe, T^k+i, k = \..x we have 

[L*'^(7rfc) -naj,R^'^{-Kk) + nuj] C [7rfc_i, TTfc+i] 
where ttq := 1, tTj^+i := n. 

We show that in Algorithm [1] this ideal scenario occurs at a subset of iterations 
on t G \..>c+ 1 and j G l..n ~ x. Wc further show that the performance weight, 
j{t,j) corresponding to these good iterations converges to a non-zero constant. 
On the other hand we show that j{t,j) converges to on all iterations where the 
ideal scenario does not occur. Hence, for every change point the weighted sum 
of its estimates obtained at every iteration converges to that of those obtained 
at the good iterations. Therefore, the final change point estimates provided by 
Algorithm [TJ approach their corresponding true values. 

Lemma 1. Let x = Xi,,n be generated by a stationary ergodic process p. For 
all C G [0, 1), a G (0, 1) and T <E N we have 

(i) lim sup \v{Xb,..b.,B)-p{B)\={) 

n^oo bi>Cn 

BeB"^' m,l€l..T 

(a) lim sup Ax(fci, ^2) = 0. 

"^°° 6i>Cn 

62>fci -{-an 



10 



Lemma 2. Assume that a sequence x = Xi..„ has a change point tt ~ 9n 
for some 6 € (0, 1) so that the segments Xi.tt, ^ii-..n are generated by two 
different processes p, p' respectively. If the distributions p, p' generating the 
data are both stationary ergodic then with probability one, for every 9 £ (0, 1) 
and C. G [0, mm{9, 1 — 9}) we have 

1/ TT — b t — TT . 

(i) hm sup d{Xb..t, - — tP + z — iP ) = 

^^°^b(il..(9-C)n t-b t-b 

teTr..(l-C)n 

7/ IT — b t — TT, 

(ii) hm sup d{Xb„t,- — rP+z — r/° ) = 

n-+oo beCn-.TT t ^ b t-b 

t£{e+Qn..n 

Lemma 3. Let S denote the minimum distance between the distinct distributions 
generating the data. Assume that for some C G (0, 1) and some t G l..>c+ 1 and 
j G L.logn we have 

inf |5*'^-7rfc|>Cn. (9) 

ft:— l..>f 
i=0..2J 

Then, 
(i) With probability one we have 

lim inf Ax(i*'^(^fe),i?*^^'(7rfe))>,5C. 

n— foo fc£l..>r 

(ii) If additionally we have 

[L*''[-Kk) ~naj,R*'^{iTk) +naj] C [7rfe_i, tt^+i] (10) 

then with probability one we obtain, 

lim sup -|$x(i*'^'(7rfc),^*'^'(7rfc),ai) -TTfel =0. 

"->°°feGl..>c" 

Proof of Theorem]^ Fix an £ > 0. There exists some J^ such that 

oo 

E ^j- ^ ^- (11) 

Recall that the algorithm specifies a^ := 2^-' for j = l..logn and generates a 
sequence of evenly-spaced indices 6*'-' where t G \..>c + 1. Observe that 

}/^^ -l^^^naj, i = l..V. (12) 

Define 

C(i, j) :- min \a,{i + 7^) - ^fe| (13) 
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for j = l..logri and t G 1..K+ 1. (Note that C,{t,j) can also be zero.) We have 

\hY-^k\>nat,3) (14) 

for aU i = 0..2-'. Let ttq := nO^ and tTj^+i := nO^^i where 9q := and ^j^+i := 1. 
Define 9 := minfegi.jj 9k ~ 9k^i and let J{9) := log |. For all j > J{9) we have 

«, < 3- (15) 

Therefore, at every iteratfon on j > J(Q) and t E l..>c + 1, for every change 
point TTfe, fc £ !..>£■ we have 

[L*'^(7rfc) -naj,R^'^(Tik) + na-,] C [tt^.i, TTfc+i] (16) 

Take a fixed a G (0, 6'/3]. For every 6'fe, fc = \..k we can uniquely define q^ G N 
and pfe G [0, a) so that 

^fc = qua+Pk- 

Therefore, for any p G [0, a) with p ^ pk, k — L.x, we have 

inf \ia + p- 9k\ > 0. 

k—1..3< 
ieNU{0} 

Clearly, we can only have xr distinct residues pk, k ~ \..k. Therefore, any subset 
of [0, a) with K~\- 1 elements, contains at least one element p' , s.t. p' ^ pk for all 
k = 1..X. Recall the definition of C{t,j) given by (|13p . By the above argument 
and noting that aj < 9/3 for all j > J{9) it follows that for every j > J{9) 
there exists at least one t G l..>f + 1 such that 

C(i,j)>0. (17) 

For every j G J{9)..n — k, let T{j) C 1..k + 1 denote the set of all iterations 
t G I..K+ 1 on which ^T7\ holds. Moreover, for j G J{9)..n — yc define 



and 



C(j) '■— min C(i, j) 



Cmin := . inf C(j)- (18) 

jeJ(e)..j^ 



Note that by definition we have 

Cmin > 0. (19) 

At every iteration on j and t the algorithm specifies a performance weight 7(i, j) 
as follows. First the set of all intervals of length Sna^, formed by consecutive 
elements of the index- sequence h^^ i = 0..2-' — 3 is partitioned into three sets of 
non-overlapping intervals. More specifically, let 

5*'^' :=:{(&*'^,6*:j'3) :i = 0..2^-3}, j = l..logn, t = \..yc+\. (20) 
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The set 5*'^ is partitioned into three disjoint subsets 5;'"', I ^ 0..2 where 

Si'' ■■- mUblisi^+i^ ■■ * = O-^^^^}- (21) 

For every fixed I — 0..2, every pair of indices {b,b') G Si'-' corresponds to 
a segment X^.f,/ of length 3naj and the distinct elements of Si''' index non- 
overlapping segments of x. For every set Si'-' , I = 0..2 the intra-distance values 
of all the segments X^i,! corresponding to pairs [h, h') e Si '■' are calculated and 
sorted in decreasing order. The x^^ highest intra-distance value is stored as 
7i, I = 0..2. Finally the performance weight is calculated as 

7(i,j) := mjn 7;. 

Let 5 denote the minimum distance between the distinct distributions generating 
the data. By ((TH), ([TC]). ([T7]) and hence Lemma [5111 for every j e J {9).. J. there 
exists some Ni{j) such that for all n > Ni{j) we have 

■miMt,j)>6aj). (22) 

Moreover, by Lemma F3liil there exists some A^2(j) such that for all n > N2{j) 
we have 

sup -|7T*;-' -TTfel < e. (23) 



kei..>t n 
tei..TU) 



Therefore we have 

1 ^' 

— E E ^vMt,J)kk~frl^\<e. (24) 

Consider the set of iterations on j > J{9) and t ^ T(j). Since C{t,j) = for all 
t ^ T{j), this means that there exists some i £ 1..2-' — 1 such that foj'"* = tt^ for 
some k £ 1..k. By (rT2t . (TTil) . (TTHl) and hence Lemma [T] from some n on we have 

max{Ax(7rfe - Snaj, TTfe), Ax(7rfc, TTfe -I- Sna^)} < £. 

Thus, for every j G J{0)..J^ there exists some Nz{j) such that for all n > N^{i) 

we have 

sup 7(i,i)<£- (25) 

Hru) 

Moreover, for all j — l..J{9) — 1 we have 



Therefore at every iteration on j £ l..J{d) — 1 and t £ 1..m: + 1, there exists 
some {b, b') £ 5*^^ such that the segment Xb..b' that contains more than a single 
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change point. Since there are exactly >c change points, then in at least one of 
the partitions S{^ for some I G 0..2 we have that within any set of x segments 
indexed by a subset of yc elements of iS; '■* , there exists at least one segment that 
contains no change points. Therefore, by (IT^ . P^ and hence Lemma [U for 
every j e \..J(&) — 1 there exists some iV(j) such that for all n > N{j) we have 

sup j{t,j)<e. (26) 

tel..xr+l 

Let iV' := max N(j) and N" max N,(j).Dc&neN:^maxiN'N"}. 

]=j(e)..Jc 

logra x+1 

Recall that as specified by Algorithm [1] we have rj := V^ V^ Wj^{t,j). Hence 

j=i t=i 
by (1221) for all n > iV we have 

77 > wiSai). (27) 

Moreover, observe that for all fc G l..>f, i G l..>f + 1 and j £ 1.. logn we have 

kl'' -<''!<"■ (28) 

We obtain, 

, JW x+i J(e) >t+i 

— im u;j7(t,j)|^fc - TT^'-'I < - 5m «^j7(i,j) 

<£<ii±i)y;'„,<£(^ (29, 

where the first inequality follows from (|28l) the second inequality follows from 
(PS)) . and the last inequality follows from (l?fl) and the fact that X^iii ""^j — 1- 
Similarly, by ([23, (I2Z1) and ^ we obtain 

log f^ / 1 \ 

;i; E i:»,7«.,)l».-*i-'l<^. (30) 

Moreover we have 

logn >c+l _. logn x+1 

widC(l) 

where the first inequality follows from (P7)) and (1^51) . and the second inequality 
follows from ([SS]) and the fact that d(-, •) < 1 so that 7(t,i) < 1 for all t G 
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1..K+ 1, j £ l..logn. Finally we have 

^ _. logn>t+l 

-kfc-^fcl < — ^^Wj7(i,j)kfc-4'-' 

J(e)>r+1 J, 

-, Je -, log" X + 1 

' j=J{e)+i t<^rij) ' 3=Je+i t=i 

WidC(l) 
Since the choice of e is arbitrary, the statement of the theorem follows. D 
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Appendix 

The following algebraic manipulation of the frequency function iy{-, •) simply follows from its definition given in 
dD); this is used in the proofs of Lemma([T][31) . Consider a sequence x = Xi..„ G A"", and a subsequence X„j..„2 
of X for some ni < n2 G l..n. For every B G _B™'', m, I G l..n we have 

J^(-^«i..n2:i') - — i^(Ai..„2,±fj — //(Ai..„i,±?j- > — [62) 

n2 — ni — m + 1 n2 — rii — m + 1 -^ — ' n2 — ni — m + 1 

i—n-i — m+2 

where 

ni+m — 1 TT r -V7- m -i 

^p I{Xi„i+rn e -o} _^ ™- 1 

•^^ n2 — JT-i ~ "^ + 1 ~ J^ 

i—ni —m+2 

Lemma 1. Let x = Xi „ &e generated by a stationary ergodic process p. For all ^ G [0,1) and a G (0,1) we 
have 

(i) for every T G N, lim sup \v{Xi, 6^, B) - p{B)\ ^ 

62>f'i-i-an 
BeB"'-' m,lel..T 

(a) lim sup Ax(6i,62)=0. 

"^oo bl>Cn 

62>6i+an 

Proof. El) Fix e > 0, a G (0, 1) and C, G [0, 1). For each m, / G l..n we can find a finite subset S*™'' of iJ™'' such 
that p(S""'') > 1 — £. For every B G S""'' m, Z G l..n there exists some N{B) such that for all n > N{B) with 
probability one we have 

sup \iy{Xi„i„B)~p{B)\<e. (33) 

b>Cn 

Fix some T G N. For all n > - and m G 1..T we have ^^ < e. Let 

— a a7i — 

N := max N(B). 
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For all n > max{iV, ^-^T} we obtain, 



sup \iy{Xb^,,b^,B)- p{B)\ 

bi>Cn 

b2^bi +an 

Bl^B"^'' m,Zel..T 

,b2 — b^ — m + 1 , „, , „M f7i — 1 , , 

< sup \ , , iyiXk,..b,,B)-p{B)\ + - (34) 

b2^bi+an 

BeB"^-' m,;ei..T 

,69— TO + l, , bi—m+1, , ,,, 2(m — 1) , , 

^ '^P 1^^ j—iyiX,..b2,B)^^ ^(^x^..b,,B)-p{B)\ + ^ -^ (35) 

bi>Cn O2 — Oi O2 — Oi O2 — Oi 

b2^bi^an 
BeB"^-' m,lGl..T 

^ ^^P \-r^'^{X^..b2,B) -^KXi..,,,i3) - p{B)\ + ^["^"/^ (36) 

6i>Cn O2 - 61 O2 - Oi O2 - &1 

BeB''^'^ m,/el..T 

^ ^^P ^-^[^(Xi..,,, i?) ~ p{B)\ + ^L^|KXi..fc,, J3) - p(i3)| + ^^"^ ~/^ 

bi>Cn O2 — Oi O2 — Ol O2 — Oi 

b2^bi+an 
BeB"^-' m,lel..T 

<2£(2+-) (37) 

a 

where ([Ml) and §^ follow from the fact that i^ (•,•)< 1, dSS]) follows from (l32|), and ([371) follows from ([33]). This 

proves pH)) . 

ini) Fix £ > 0, a G (0, 1) and C e [0, 1). There exists some T e N such that 

00 

y^ WmW; < £. (38) 

Define c := ^^i^. By Lemma (jllip there exists some iV such that for all 71 > iV we have 

sup W{Xmin{bi,c}..m!i^{b,,c},B) - p{B)\<e. (39) 

i=l,2 
6i>Cn 

BeB"^-' 7n,lel..T 

Recall the definition of A(-, •) given in ([T]). From ([55)) and (15^1) for all n > A^ we have 

sup Ax(6l,fo2) = rf(^bi..c,-'«^c..62) 
bi>Cn 
62>bi+an 

T 

< sup ^ w™w/ Yl W{Xb,..c,B)~p{B)\ + \iy{X,„b2,B)-p{B)\+e 

<3e 

and (fTEil) follows. D 



Lemma 2. ^ssurree t/iat a sequence x = Xi..„ /las a change point n = 9n for some 9 € (0, 1) so that the segments 
Xl.tt, X-,:,,n are generated by two different processes p, p' respectively. If the distributions p, p' generating the 
data are both stationary ergodic then with probability one, for every 9 G (0, 1) and Q G [0, min{0, 1 — 0}) we have 

TT — 6 t — TT . 

(i) hm sup d(Xb..t, - — rP + 1 — rP ) = 
"-^°°hei..(e-C)n *~o t-o 

teir..(l-()n 
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TT — b t — TT 

(a) lim sup d{Xb„t,- — rP+z — T/^') = 

te(6'+C)n..n 

Proo/. Fix £ > 0, 6* e (0, 1), C e [0, min{6', 1 - 6*}). There exists some T e N such that 



^ WmW; < e. (40) 

m,l=T 



To prove Lemma l2lil we proceed as follows. 

By Lemma [HI] there exists some N such that for all n > TV we have 



sup HXi,„^,B)-p{B)\ < e (41) 

bGl..(e-C)n 
BeB'"'' mJel..T 

sup \iy{Xt..n,B)-p'iB)\ < e. (42) 

*G7r..(l-C)n 
BeB"^' m,lel..T 



Note that t — b> Qn for all fe G l..(6' — Qn, t G 7r..(l — C,)n. Therefore, we have 



t-h- C 
Tnci T h i^ ^ in ^ r\ri 



" <7- (43) 



Moreover for all n > ^, me L.T, h £ l..(0 — C)"- a-nd t e 7r..(l — Qn we have 



?77, — 1 TO , , 

— ^ < ;^ < £. 44 

Using the decomposition given in (|32|) we obtain the following bound for all h G l..{0 — C)*^! ^ £ 7r..(l — C)n and 
ahS G B™'' TO,/ G 1..T. 

i — vr — TO, + 1 , 



t-b 



\u{X^..uB)-p'{B)\ 



,n — 'n — m+\,, „, ,,„^^ n — t — m+1,, „, ,,„,,, to— 1 

< I j—^ {i^iX...n, B) - p'{B)) ^— ^ {v{Xt..n. B) - p'{B))\ + -j-^ 

n — TT — m + 1 , , , I , , , n — t — m + I , , , , , ,, to— 1 , ^ 

< ^3^ W{X^..n,B) - p'{B)\ + -—^ HXt.,,.B)-p'{B)\ + ^— ^. (45) 

Let N' := max{7V, ^}. For all n > A^' we have 

sup HX,..uB) ^P{B) ^—^p'iB)\ 

bei..{e-c)n t~o t~o 

te7i-..(i-C)« 

BeB"^' m,lel..T 

< sup I ,.(X,..,,i?)-— — p(B)--— p'(i?)| + — — (46) 

bei..(e-c)« ^^° *^° ^^° ^^'^ 

teTr..(l-C)n 
BeB"'^-' m,lel..T 

■K — b — m + 1 , , „, , „, , t — TT — 771 + 1 , , „, ,.„M 4(777—1) 

< sup \v{Xt,„^,B)-p{B)\ + \v{X^„t,B)~p\B)\+ \ > (47) 

te7r..(l-C)n 
BGB"^-' m,lel..T 

TT — b — 771 + 1 , , „, , „, , 77 — TT — 777 + 1 , , „, , . „,, 

< sup HXb...,B)-p{B)\ + HX^„,,,B)~p'{B)\ 

tGTT. .(l-C)n 
BGB"^' m,lel..T 

77 — t — 777 + 1 , , „, ,,„M 5(777—1) 

+ ^3^ WiXt..,,,B)^p'iB)\ + ^^-^ (48) 

<£(5 + ^) (49) 
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where (gH]) and (gT]) follow from the fact that p{-) < 1 and j/(-,-) < 1, (gS]) follows from (gS]) and the last 
inequality follows from (gU, (g2]), (gS]) andigH). 

Finally by (gO]) and (gH]) for all n > N' we obtain 

V/ 1^ TT - 6 t — n . 
sup d(Xb„t,- — rP+-7 — rP) 
feei..(e-c)n *■ ~ " t — o 

tSTT. .(l-C)n 

T 

^ sup y] y" w,r^wi\i^iXb..t,B)-^^p{B)-—^p'{B)\+e 

te7r..(l-C)n 

<3£(2 + i) 

and Lemma I2lil follows. The proof of Lemma I2liil is analogous. D 

Lemma 3. Let 5 denote the minimum distance between the distinct distributions generating the data. Assume 
that for some ^ G (0, 1) and some t G 1..X + 1 and j G I., logn we have 

inf \bl^' ~ TTkl > Cn. (50) 

k—l..>i 
i=0..2J 

Then, 

(i) With probability one we have 

lim inf Ax(L*'J'(7rfc),i?*'^(^fe))>,5C. 

n— >oo feGl..>ir 

('mJ // additionally we have 

[L*"'(7rfe) - naj,R*^HT:k) + naj\ C [7rfe_i, TTfe+i] (51) 

i/iera with probability one we obtain, 

lim sup -|$x(i*'^(7r/c),-R*'^'(7rfc),aj)-7rfc| =0. 

Proof. (jT]). Fix some fc G 1..K. For simplicity of notation, let Ik and r^ denote L^'^iTk) and R^'^iTk) respectively. 
Define Cfe := ^^i^. Following the definition of A(-, ■, ■) given by ([7]) we have 

A{lk,rk) ■.= d{Xi^„e^,Xe^„rJ. 

To prove Lemma I3lil we show that from some n on with probability 1 we have 

d{Xi,..,,,X,,..rJ>SC. (52) 

To prove ((52|) for the case where tt^ < Ck, k = 1..k we proceed as follows. 



As specified by the algorithm the difference between Ik and rk is linear in n so that for all A; G l..>«' we have 

rk-lk^naj. (53) 

Hence, it is easy to see that 

kfc+i-cfc| > (C + ^)n. (54) 

Fix e > 0. By ([53)) , ([54]) and hence Lemma [1] there exists some iVi such that for all n > A^i we have 

d{Xc^..r^,Pk+i) <e. (55) 
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By Lemma [5] there exists some N2 such that for all n > N2 we have 

d{Xi^..ck, pP/c + j-Pk+i)<e. (56) 

Cfe —Ik Ck — Ik 

By ((50)) we have 

TTk-lk ^TTk^^^^ (57) 



Moreover we have the following lower bound on d{pk+i, ^ ''_]'' Pk + '''' J^'° Pk+i] 



Ck - h n 



1 °° ; 

a[pk+i, pPfeH r/'fe+i) = > w„Wi > \Pk+i[B) i-Pk{B)^ —pk+i{B)\ 

Ck - Ik Ck - Ik , , „ „ , Ck-ik Ck - Ik 

00 , 

= V WjnWl V —\pk+l{B) - pk{B)\ 

— — Ch — ik 

m,l = l BeB"^-' 

= ^^^d{pk+i,pk)>5C (58) 

Ck - I'k 

where, the inequality follows from (I57p and the definition of 5 as the minimum distance between the distributions. 
Finally we obtain, 

A^{lk,rk) = d{Xi^„c,,Xc,..rJ 

> d{Xi^,,c,, Pk+i) - d{ck--rk, Pk+i) 

... TTfe - /fe Ck — TTk . ;,, T^k — h Cfc - TTfe n t/ n 

> d[pk+i, —Pk H 7-Pk+i) - d(lk--Ck, pPfc H ;— Pfc+i) - d[Ck--rk, Pk+i) 

Ck — l-k Ck — Ik Ck — ik Ck — ik 

... T^k-h , Ck-TTk , 

> d[pk+i, —pk ^ 7-Pk+ij ~ ^e 

Cfc — tfe Cfc — ffe 

> (5C - 2e (59) 

where the first and second inequalities follow from applying the triangle inequality on d(-, ■), the third inequality 
follows from ([55]) and ([56]) and the last inequality follows from ([58]). Since ([59]) holds for every e > 0, this proves 
([52]) in the case where TTk < Ck- The proof for the case where tt^ > Ck is analogous. Since ([52]) holds for every 
k G 1..X, Lemma [3lil follows. 
^. Fix some k G l..x. Following the definition of $(■,-, ■) given by ([5]) we have 

$(Zfc -naj.Tk +naj,aj) := argmax d{Xi^^naj..i' , X^ ,,r^+na-). 

To prove Lemma l3liil it suffices to show that for every /3 G (0, 1) with probability 1 from some n on we have for 

aUrGZfe..(l-/3)^feU^fe(l + /3)..rfe, 

d{Xi^^naj ..I' , Xi> ,,rk+naj) < '^(-''^Ifc -nctj ..TTfc , -^irfc..rfc+naj )■ (60) 

To prove ([60| for I' G Zfc..(l — P)TTk we proceed as follows. 

Fix some 13 G (0, 1) and e > 0. First note that for all I' G Zfe..(l — /3)7rk we have 

l"'^' >/3- (61) 

Note that by ([^ the sequence X;j^„„Q,^..rfc is a subsequence of ^7rfc_i..7rfc+i- By Lcmma[Tl there exists some A^i 
such that for all n > A^i we have 

sup d{Xi^_riaj..l',Pk) <£■ (62) 
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Similarly, by Lemma [T] there exists some N2 such that for all n > N2 wc have 

d{X^^.,r^+na,,Pk+l) < £■ (63) 

Moreover, by Lemma [2] there exists some N2 such that 

sup d[Xi,,,r^+na,, ■ ^Pfc H \ -fj-pk+l)<£- (64) 

Let N :— maxNi. By (15^ . (p5|) and the subsequent application of the triangle inequality on d{-,-) we obtain 

2—1. .3 

the following lower bound for all n > iV. 

inf d{Xi^_naj..TrkyX-!^k--rk+naj) > inf diXi^^^naj ..tt^, Pk+l) - d{X^i^,,rk+naj , Pk+l) 

I Glk..{l-l3)lTk I eifc..(l-;3)7rfc 

> inf d{Xi^_nai..7:k7Pk+l) - £ 

> inf d{pk,Pk+i) - d{pk,Xi^^na,..TTk) - £ 

> inf d{pk,Pk+i) -26. (65) 

I'eh..(l-I3)7rk 

By (|62l) . ([M)) and applying the triangle inequality on (i(-, •) we obtain the following upper bound for all n > A^ . 

sup d{Xi^-naj..l',Xi>,,rk+naj) < SUp d{Xi^^naj ..V ■, Pk) + d{pk-, Xii ,,rk+naj) 

< sup d(pfe,Xp..r^+„Qj +e 

J, TTfe - r rfe + naj - TTfe 

< sup d(pfc, ■ -pfcH ■ jrPk+i) 

+ d(Xp.,r,+„a . , ■ -pk H ■ TTPk+i) + e 

Tk + naj —I Tk + naj — I 

< sup d(pfe, ■ -pk~\ ■ —pk+i) + 2e. (66) 

i'eh..ii~0)^^ ^k + naj -I' rk + na.j - V 

We also have 

. TTfc — V Tk + naj — TTk 

d{pk,Pk+i) - d[pk, ■ -pk + ■ TfPk+i) 

Tk + naj — I rk + naj — r 



J2 y^mWl J2 l'°fc(^) - '°'^+l(^)l - l-^^-^^) - -^ F^'^^^) - '/'''' ]' Pk+l{B)\ 

—■' ^ ^-^ , rk + na, - r rk + naj — /' 

y^ WrnWl V \pkiB) ~ Pk+l{B)\ - -!^ ^ -^IpkiB) ~ Pk+liB)\ 

■^-^ ^-^ rk + naj — I 

m,l=l BeB"'-' ■' 

1/ oc 



7 ^ li^mw^i Y. \pkiB) - pk+iiB)\ > 13S. (67) 



rk + naj - Z' , , 

■' m,l=l BeB-^'^ 



where the inequality follows from (|6ip and definition of 5 as the minimum distance between the distributions 
generating the data. Finally, from (|65p . (j66| and ([67]) for all ti > iV we obtain, 



inf d{Xi nai..T!kT^Ttk--rk+nai) — d{Xl na,..V,Xv,,r^+na,) > /?(5 — 4e. (68) 

Since (|68| holds for every e > 0, this proves (|60| for V e Zfc..(l — /3)7rfe. The proof for the case where I' G 
(1 + /3)7rk--rk is analogous. Since (|60|) holds for every k G l..x, the statement of Lemma FSIiil follows. D 
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